[1] 60 36 72 30 30 64 63 56 72 53
The mutate() function takes in the following arguments: the first argument is the dataframe of interest, and the second argument is a new or existing data variable that is defined in terms of other data variables.
Instead of mutate() function, we can also create a new or modify a column via the $ symbol:
expression
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp |
|---|---|---|
| “ACH-001113” | 5.138733 | 1.636806 |
| “ACH-001289” | 3.184280 | 1.158226 |
| “ACH-001339” | 3.165108 | 1.152187 |
metadata
| ModelID | OncotreeLineage | Age |
|---|---|---|
| “ACH-001113” | “Lung” | 69 |
| “ACH-001289” | “CNS/Brain” | NA |
| “ACH-001339” | “Skin” | 14 |
I want to compare the relationship between OncotreeLineage and PIK3CA_Exp:
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age |
|---|---|---|---|---|
| “ACH-001113” | 5.138733 | 1.636806 | “Lung” | 69 |
| “ACH-001289” | 3.184280 | 1.158226 | “CNS/Brain” | NA |
| “ACH-001339” | 3.165108 | 1.152187 | “Skin” | 14 |
We see that in both dataframes, the rows (observations) represent cell lines with a common column ModelID, so let’s merge these two dataframes together, using full_join():
Let’s take a look at the dimensions:
full_join() keeps all observations common to both dataframes based on the common column defined via the by argument.
Therefore, we expect to see NA values in merged, as there are some cell lines that are not in expression dataframe.
Given xxx_join(x, y, by = "common_col"),
full_join() keeps all observations.
left_join() keeps all observations in x.
right_join() keeps all observations in y.
inner_join() keeps observations common to both x and y.
| ModelID | OncotreeLineage | Age |
|---|---|---|
| “ACH-001113” | “Lung” | 69 |
| “ACH-001289” | “Lung” | 23 |
| “ACH-001339” | “Skin” | 14 |
| “ACH-002342” | “Brain” | 23 |
| “ACH-004854” | “Brain” | 56 |
| “ACH-002921” | “Brain” | 67 |
Desired rows: cancer subtype.
Desired columns: mean age.
| OncotreeLineage | MeanAge | Count |
|---|---|---|
| “Lung” | 46 | 2 |
| “Skin” | 14 | 1 |
| “Brain” | 48.67 | 3 |
The rows I want is described by a column. The columns I want need to be summarized from other columns.
The group_by() function returns the identical input dataframe but remembers which variable(s) have been marked as grouped.
The summarise() returns one row for each combination of grouping variables, and one column for each of the summary statistics that you have specified.
Functions you can use for summarise() must take in a vector and return a simple data type, such as any of our summary statistics functions: mean(), median(), min(), max(), etc.
The exception is n(), which returns the number of entries for each grouping variable’s value.
When combining multiple functions in one expression, it gets harder to read:
Or, this: 🤨
result2 = function1(function2(function3(dataframe)))
Or… 🤕
result = function1(function2(function3(dataframe, df_col4, df_col2), arg2), df_col5, arg1)
result2 = dataframe %>% function1 %>% function2 %>% function3
result = function1(df_col5, arg1) %>%
function2(arg2) %>%
function3(df_col4, df_col2)
Rewrite the select() and filter() function composition example above using the pipe metaphor and syntax.